SeSAMe implements inference of sex, age, ethnicity. These are valuable information for checking the integrity of the experiment and detecting sample swaps.
Sex is inferred based on our curated X-linked probes and Y chromosome probes excluding pseudo-autosomal regions and XCI escapes.
Human:
Mouse:
## Loading required namespace: e1071
## [1] Male
## Levels: Female Male
Ethnicity is inferred using a random forest model trained based on both the built-in SNPs (rs probes) and channel-switching Type-I probes.
## Loading required namespace: randomForest
## [1] "WHITE"
Horvath 353 Model:
SeSAMe provides age regression a la the Horvath 353 model.
## [1] 84.13913
Mouse Array 347 Model:
The age of the mouse can be predicted using the predictMouseAgeInMonth function. This looks for overlapping probes and estimates age using an aging model built from 347 MM285 probes. The function returns a numeric output of age in months. The model is most accurate with SeSAMe preprocessing. Here’s an example.
## [1] 1.413134
This indicates thaat this mouse is approximately 1.41 months old.
SeSAMe performs copy number variation in three steps: 1) normalizes the signal intensity using a copy-number-normal data set; 2) groups adjacent probes into bins; 3) runs DNAcopy internally to group bins into segments.
To visualize segmentation in SeSAMe,
SeSAMe estimates leukocyte fraction using a two-component model.This function works for samples whose targeted cell-of-origin is not related to white blood cells.
## [1] 0.2007592
The goal of data sanitization is to modifiy IDAT files in place, so they can be released to public domain without privacy leak. This will be achieved by deIdentification.
Let’s take DNA methylation data from the HM450 platform for example.
tmp = tempdir()
res_grn = sesameAnno_get("Test/3999492009_R01C01_Grn.idat", dest_dir=tmp)
res_red = sesameAnno_get("Test/3999492009_R01C01_Red.idat", dest_dir=tmp)This first method of deIdentification masks SNP probe intensity mean by zero. As a consequence, the allele frequency will be 0.5.
deIdentify(res_grn$dest_file, sprintf("%s/deidentified_Grn.idat", tmp))
deIdentify(res_red$dest_file, sprintf("%s/deidentified_Red.idat", tmp))
betas1 = getBetas(readIDATpair(sprintf("%s/Test/3999492009_R01C01", tmp)))
betas2 = getBetas(readIDATpair(sprintf("%s/deidentified", tmp)))
head(betas1[grep('rs',names(betas1))])
head(betas2[grep('rs',names(betas2))])Note that before deIdentify, the rs values will all be different. After deIdentify, the rs values will all be masked at an intensity of 0.5.
This second method of deIdentification will scramble the intensities using a secret key to help formalize a random number. Therefore, randomize needs to be set to TRUE.
my_secret <- 13412084
set.seed(my_secret)
deIdentify(res_grn$dest_file,
sprintf("%s/deidentified_Grn.idat", tmp), randomize=TRUE)
my_secret <- 13412084
set.seed(my_secret)
deIdentify(res_red$dest_file,
sprintf("%s/deidentified_Red.idat", tmp), randomize=TRUE)
betas1 = getBetas(readIDATpair(sprintf("%s/Test/3999492009_R01C01", tmp)))
betas2 = getBetas(readIDATpair(sprintf("%s/deidentified", tmp)))
head(betas1[grep('rs',names(betas1))])
head(betas2[grep('rs',names(betas2))]) Note that the rs values are scrambled after deIdentify.
To restore order of the deIdentified intensities, one can re-identify IDATs. The reIdentify function can thus restore the scrambled SNP intensities.
my_secret <- 13412084
set.seed(my_secret)
reIdentify(sprintf("%s/deidentified_Grn.idat", tmp),
sprintf("%s/reidentified_Grn.idat", tmp))
my_secret <- 13412084
set.seed(my_secret)
reIdentify(sprintf("%s/deidentified_Red.idat", tmp),
sprintf("%s/reidentified_Red.idat", tmp))
betas1 = getBetas(readIDATpair(sprintf("%s/Test/3999492009_R01C01", tmp)))
betas2 = getBetas(readIDATpair(sprintf("%s/reidentified", tmp)))
head(betas1[grep('rs',names(betas1))])
head(betas2[grep('rs',names(betas2))]) Note that reIdentify restored the values. Subsequently, they are the same as betas1.
## R Under development (unstable) (2021-11-09 r81170)
## Platform: x86_64-apple-darwin20.6.0 (64-bit)
## Running under: macOS Big Sur 11.6.2
##
## Matrix products: default
## BLAS: /Users/zhouw3/.Renv/versions/4.2.dev/lib/R/lib/libRblas.dylib
## LAPACK: /Users/zhouw3/.Renv/versions/4.2.dev/lib/R/lib/libRlapack.dylib
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] SummarizedExperiment_1.25.3 Biobase_2.55.0
## [3] GenomicRanges_1.47.6 GenomeInfoDb_1.31.6
## [5] IRanges_2.29.1 S4Vectors_0.33.11
## [7] MatrixGenerics_1.7.0 matrixStats_0.61.0
## [9] sesame_1.13.38 sesameData_1.13.33
## [11] ExperimentHub_2.3.5 AnnotationHub_3.3.9
## [13] BiocFileCache_2.3.4 dbplyr_2.1.1
## [15] BiocGenerics_0.41.2 rmarkdown_2.11
## [17] R6_2.5.1
##
## loaded via a namespace (and not attached):
## [1] bitops_1.0-7 bit64_4.0.5
## [3] filelock_1.0.2 RColorBrewer_1.1-2
## [5] httr_1.4.2 tools_4.2.0
## [7] bslib_0.3.1 utf8_1.2.2
## [9] DBI_1.1.2 colorspace_2.0-3
## [11] withr_2.5.0 tidyselect_1.1.2
## [13] preprocessCore_1.57.0 bit_4.0.4
## [15] curl_4.3.2 compiler_4.2.0
## [17] cli_3.2.0 DelayedArray_0.21.2
## [19] sass_0.4.1 scales_1.1.1
## [21] randomForest_4.7-1 readr_2.1.2
## [23] proxy_0.4-26 rappdirs_0.3.3
## [25] stringr_1.4.0 digest_0.6.29
## [27] XVector_0.35.0 pkgconfig_2.0.3
## [29] htmltools_0.5.2 fastmap_1.1.0
## [31] rlang_1.0.2 RSQLite_2.2.11
## [33] shiny_1.7.1 jquerylib_0.1.4
## [35] generics_0.1.2 jsonlite_1.8.0
## [37] wheatmap_0.2.0 BiocParallel_1.29.18
## [39] dplyr_1.0.8 RCurl_1.98-1.6
## [41] magrittr_2.0.2 GenomeInfoDbData_1.2.7
## [43] Matrix_1.4-0 Rcpp_1.0.8.3
## [45] munsell_0.5.0 fansi_1.0.3
## [47] lifecycle_1.0.1 stringi_1.7.6
## [49] yaml_2.3.5 MASS_7.3-55
## [51] zlibbioc_1.41.0 plyr_1.8.7
## [53] grid_4.2.0 blob_1.2.2
## [55] parallel_4.2.0 promises_1.2.0.1
## [57] crayon_1.5.1 lattice_0.20-45
## [59] Biostrings_2.63.2 hms_1.1.1
## [61] KEGGREST_1.35.0 knitr_1.37
## [63] pillar_1.7.0 reshape2_1.4.4
## [65] glue_1.6.2 BiocVersion_3.15.0
## [67] evaluate_0.15 BiocManager_1.30.16
## [69] png_0.1-7 vctrs_0.3.8
## [71] tzdb_0.2.0 httpuv_1.6.5
## [73] gtable_0.3.0 purrr_0.3.4
## [75] assertthat_0.2.1 cachem_1.0.6
## [77] ggplot2_3.3.5 xfun_0.29
## [79] mime_0.12 xtable_1.8-4
## [81] e1071_1.7-9 later_1.3.0
## [83] class_7.3-20 tibble_3.1.6
## [85] AnnotationDbi_1.57.1 memoise_2.0.1
## [87] ellipsis_0.3.2 interactiveDisplayBase_1.33.0
## [89] BiocStyle_2.23.1